62 research outputs found
Boosting the Basic Counting on Distributed Streams
We revisit the classic basic counting problem in the distributed streaming
model that was studied by Gibbons and Tirthapura (GT). In the solution for
maintaining an -estimate, as what GT's method does, we make
the following new contributions: (1) For a bit stream of size , where each
bit has a probability at least to be 1, we exponentially reduced the
average total processing time from GT's to
, thus providing the first
sublinear-time streaming algorithm for this problem. (2) In addition to an
overall much faster processing speed, our method provides a new tradeoff that a
lower accuracy demand (a larger value for ) promises a faster
processing speed, whereas GT's processing speed is
in any case and for any . (3) The worst-case total time cost of our
method matches GT's , which is necessary but rarely
occurs in our method. (4) The space usage overhead in our method is a lower
order term compared with GT's space usage and occurs only times
during the stream processing and is too negligible to be detected by the
operating system in practice. We further validate these solid theoretical
results with experiments on both real-world and synthetic data, showing that
our method is faster than GT's by a factor of several to several thousands
depending on the stream size and accuracy demands, without any detectable space
usage overhead. Our method is based on a faster sampling technique that we
design for boosting GT's method and we believe this technique can be of other
interest.Comment: 32 page
On Longest Repeat Queries Using GPU
Repeat finding in strings has important applications in subfields such as
computational biology. The challenge of finding the longest repeats covering
particular string positions was recently proposed and solved by \.{I}leri et
al., using a total of the optimal time and space, where is the
string size. However, their solution can only find the \emph{leftmost} longest
repeat for each of the string position. It is also not known how to
parallelize their solution. In this paper, we propose a new solution for
longest repeat finding, which although is theoretically suboptimal in time but
is conceptually simpler and works faster and uses less memory space in practice
than the optimal solution. Further, our solution can find \emph{all} longest
repeats of every string position, while still maintaining a faster processing
speed and less memory space usage. Moreover, our solution is
\emph{parallelizable} in the shared memory architecture (SMA), enabling it to
take advantage of the modern multi-processor computing platforms such as the
general-purpose graphics processing units (GPU). We have implemented both the
sequential and parallel versions of our solution. Experiments with both
biological and non-biological data show that our sequential and parallel
solutions are faster than the optimal solution by a factor of 2--3.5 and 6--14,
respectively, and use less memory space.Comment: 14 page
CloudTree: A Library to Extend Cloud Services for Trees
In this work, we propose a library that enables on a cloud the creation and
management of tree data structures from a cloud client. As a proof of concept,
we implement a new cloud service CloudTree. With CloudTree, users are able to
organize big data into tree data structures of their choice that are physically
stored in a cloud. We use caching, prefetching, and aggregation techniques in
the design and implementation of CloudTree to enhance performance. We have
implemented the services of Binary Search Trees (BST) and Prefix Trees as
current members in CloudTree and have benchmarked their performance using the
Amazon Cloud. The idea and techniques in the design and implementation of a BST
and prefix tree is generic and thus can also be used for other types of trees
such as B-tree, and other link-based data structures such as linked lists and
graphs. Preliminary experimental results show that CloudTree is useful and
efficient for various big data applications
Shortest Unique Substring Query Revisited
We revisit the problem of finding shortest unique substring (SUS) proposed
recently by [6]. We propose an optimal time and space algorithm that can
find an SUS for every location of a string of size . Our algorithm
significantly improves the time complexity needed by [6]. We also
support finding all the SUSes covering every location, whereas the solution in
[6] can find only one SUS for every location. Further, our solution is simpler
and easier to implement and can also be more space efficient in practice, since
we only use the inverse suffix array and longest common prefix array of the
string, while the algorithm in [6] uses the suffix tree of the string and other
auxiliary data structures. Our theoretical results are validated by an
empirical study that shows our algorithm is much faster and more space-saving
than the one in [6]
Time-decaying Sketches for Robust Aggregation of Sensor Data
We present a new sketch for summarizing network data. The sketch has the following properties which make it useful in communication-efficient aggregation in distributed streaming scenarios, such as sensor networks: the sketch is duplicate insensitive, i.e., reinsertions of the same data will not affect the sketch and hence the estimates of aggregates. Unlike previous duplicate-insensitive sketches for sensor data aggregation [S. Nath et al., Synposis diffusion for robust aggregation in sensor networks, in Proceedings of the 2nd International Conference on Embedded Network Sensor Systems, (2004), pp. 250–262], [J. Considine et al., Approximate aggregation techniques for sensor databases, in Proceedings of the 20th International Conference on Data Engineering (ICDE), 2004, pp. 449–460], it is also time decaying, so that the weight of a data item in the sketch can decrease with time according to a user-specified decay function. The sketch can give provably approximate guarantees for various aggregates of data, including the sum, median, quantiles, and frequent elements. The size of the sketch and the time taken to update it are both polylogarithmic in the size of the relevant data. Further, multiple sketches computed over distributed data can be combined without loss of accuracy. To our knowledge, this is the first sketch that combines all the above properties
Charge transport in nanoscale vertical organic semiconductor pillar devices
We report charge transport measurements in nanoscale vertical pillar
structures incorporating ultrathin layers of the organic semiconductor
poly(3-hexylthiophene)(P3HT). P3HT layers with thickness down to 5 nm are
gently top-contacted using wedging transfer, yielding highly reproducible,
robust nanoscale junctions carrying high current densities (up to
A/m). Current-voltage data modeling demonstrates excellent hole injection.
This work opens up the pathway towards nanoscale, ultrashort-channel organic
transistors for high-frequency and high-current-density operation.Comment: 30 pages, 8 figures, 1 tabl
- …